Figure 1. Map showing the collection locations of 465 specimens available after basic initial filtering. Colors represent “population cluster” as assigned by tree method (tree length = 10km). Hover over the dots to see what cluster they were assigned to!
In this table you can see various issues like colony assignments across clusters (so really long distances) or years (so, siblings not possible).
This table similarly shows how many colonies this is an issue for (singletons excluded)
Because the output of COLONY is so suspect at the time, I wanted to proceed without excluding siblings. There’s debate as to whether or not this is “valid”. The basic idea is you exclude siblings since they represent a biased subset of genetic variability…but if you’re sampling randomly, and there’s lot of siblings in the data, then it’s possibly because there’s not a lot of variability in the population so you should expect to encounter (and include) siblings a lot.
In any event, the following analysis is with the same specimens as the COLONY run: females, from unknown colonies, with at least 10 loci of data
Maybe at this time this is more for “sake of argument” until we chase down the underlying issues with COLONY…or maybe this will help reveal why COLONY is struggling…
There are no loci with <80% completeness
## named numeric(0)
There are no individuals with poor data in the dataset…but that is because I also excluded them earlier in the process. So this is just a sanity check.
## named numeric(0)
This suggests there might be a couple clones/duplicates in here. This is possible because of low genetic variability…or simply someone got double counted. Explore this more. The three that are clones are from adjacent values (e.g. BAFF429 and BAFF430) - so potential from cross contamination or simply just they were caught from the same population..
## #############################
## # Number of Individuals: 307
## # Number of MLG: 304
## #############################
## [1] 304
Just a little sanity check that all the loci are polymorphic (they are)
## Mode TRUE
## logical 13
Table of various summary stats
Only displaying populations with 5 or more specimens.
Figure 2. Observed (blue) versus expected (grey)
heterozygosity across the regions. Numbers above bars are the sample
size for each region.
Same (except ALL clusters, not just those with >=5 specimens) as figure but in table format for investigating to your heart’s content. Added some math for a little helper.
This is just kinda trash for now. Need to determine how we really wanna cluster things. This results in way too many clusters to be intelligible so I haven’t really worked with getting it clean.
Below is with only populations having >=5 specimens